Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.
# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Sun Dec 11 14:23:19 2022"
How are you feeling right now? - I’m feeling a bit overwhelmed but hopeful. What do you expect to learn? - Running data analysis with R. Where did you hear about the course? - This course was recommended on Kimmo’s other courses.
How did the R for Health Data Science book and the Exercise Set 1 work as a “crash course” on modern R tools and using RStudio? - This was an interesting addition to the course materials. I like how the book and the Exercise file are synchronized. Which were your favorite topics? - I found th different methods for filtering data useful. Which topics were most difficult? - Joining multiple data sets Some other comments on the book and our new approach of getting started with R Markdown etc.? - With little background on working with R there seems to be a lot to learn. Luckily there is no need to memorize everything.
Link to GitHub repository: https://github.com/AapoVir/IODS-project.git
Describe the work you have done this week and summarize your learning.
date()
## [1] "Sun Dec 11 14:23:19 2022"
#install.packages("GGally")
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
1.Read the students2014 data into R either from your local folder
learning2014 <- read_csv("data/learning2014.csv")
## Rows: 166 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): gender
## dbl (6): age, attitude, deep, stra, surf, points
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#Explore the structure and the dimensions of the data and describe the dataset briefly
dim(learning2014)
## [1] 166 7
#The dataset has 7 variables (columns) and 166 observations (rows)
str(learning2014)
## spc_tbl_ [166 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ gender : chr [1:166] "F" "M" "F" "M" ...
## $ age : num [1:166] 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num [1:166] 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num [1:166] 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num [1:166] 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num [1:166] 2.58 3.17 2.25 2.25 2.83 ...
## $ points : num [1:166] 25 12 24 10 22 21 21 31 24 26 ...
## - attr(*, "spec")=
## .. cols(
## .. gender = col_character(),
## .. age = col_double(),
## .. attitude = col_double(),
## .. deep = col_double(),
## .. stra = col_double(),
## .. surf = col_double(),
## .. points = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
# The dataset has six numeric variables (age, attitude, deep, stra, surf, points) and one character/categorical variable (age).
2.Show a graphical overview of the data
ggpairs(learning2014, mapping = aes(col = gender,alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
# Show summaries of the variables in the data
summary(learning2014)
## gender age attitude deep
## Length:166 Min. :17.00 Min. :1.400 Min. :1.583
## Class :character 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333
## Mode :character Median :22.00 Median :3.200 Median :3.667
## Mean :25.51 Mean :3.143 Mean :3.680
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083
## Max. :55.00 Max. :5.000 Max. :4.917
## stra surf points
## Min. :1.250 Min. :1.583 Min. : 7.00
## 1st Qu.:2.625 1st Qu.:2.417 1st Qu.:19.00
## Median :3.188 Median :2.833 Median :23.00
## Mean :3.121 Mean :2.787 Mean :22.72
## 3rd Qu.:3.625 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :5.000 Max. :4.333 Max. :33.00
type_sum(learning2014)
## [1] "spc_tbl_[,7]"
Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them.
Women outnumber men in the cohort. Distribution of age is skewed to the left. Age of participants ranges between 17 and 55 years, mean 25,51 and median 22,00 years. Attitude and points are positively correlated. Surface and strategic learning are negatively correlated.
3.Choose three variables as explanatory variables and fit a regression model where exam points is the target (dependent, outcome) variable.
#access library
library(ggplot2)
#Fit a regression model for surface and deep questions as explanatory variables and points as the outcome variable
lm1 <- lm(points ~ surf+deep+age, data = learning2014)
lm1
##
## Call:
## lm(formula = points ~ surf + deep + age, data = learning2014)
##
## Coefficients:
## (Intercept) surf deep age
## 33.24979 -2.03323 -0.70497 -0.08905
#Show a summary of the fitted model and comment and interpret the results. Explain and interpret the statistical test related to the model parameters.
summary(lm1)
##
## Call:
## lm(formula = points ~ surf + deep + age, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.0771 -3.2767 0.2549 4.1802 10.6178
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.24979 5.07515 6.551 7.22e-10 ***
## surf -2.03323 0.91707 -2.217 0.028 *
## deep -0.70497 0.86668 -0.813 0.417
## age -0.08905 0.05910 -1.507 0.134
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.835 on 162 degrees of freedom
## Multiple R-squared: 0.03794, Adjusted R-squared: 0.02012
## F-statistic: 2.129 on 3 and 162 DF, p-value: 0.09855
The test provides a multivariate linear regression model. Surface questions correlate negatively with the outcome variable ‘points’. The correlation is statistically significant. Deep questions or age have no statistically significant correlation with the outcome variable.
#If an explanatory variable in your model does not have a statistically significant relationship with the target variable, remove the variable from the model and fit the model again without it.
# Repeating the model without 'deep' and 'age'.
lm2 <- lm(points ~ surf, data = learning2014)
lm2
##
## Call:
## lm(formula = points ~ surf, data = learning2014)
##
## Coefficients:
## (Intercept) surf
## 27.202 -1.609
4.Using a summary of your fitted model, explain the relationship between the chosen explanatory variables and the target variable (interpret the model parameters). Explain and interpret the multiple R-squared of the model.
#Taking the summary of the new model.
summary(lm2)
##
## Call:
## lm(formula = points ~ surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.6539 -3.3744 0.3574 4.4734 10.2234
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.2017 2.4432 11.134 <2e-16 ***
## surf -1.6091 0.8613 -1.868 0.0635 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.851 on 164 degrees of freedom
## Multiple R-squared: 0.02084, Adjusted R-squared: 0.01487
## F-statistic: 3.49 on 1 and 164 DF, p-value: 0.06351
Removing ‘deep’ and ‘age’ from the model renders the association of surface questions with ‘points’ statistically insignificant (p= 0.06351).
R-squared is measure of accuracy for linear models. It tell to what extent the tested variable explains the changes in the output variable. Adjusted R-squared: 0.01487 indicates that about 15 % of the variance in ‘points’ is explained by the value of ‘surf’.
5.Produce the following diagnostic plots
#Residuals vs Fitted values
plot(lm2, which=c(1))
#Normal QQ-plot
plot(lm2, which=c(2))
#Residuals vs Leverage.
plot(lm2, which=c(5))
Explain the assumptions of the model and interpret the validity of those assumptions based on the diagnostic plots.
The first diagnostic plot ‘Residuals vs Fitted’ plots the residuals against the fitted values of the response variable ‘points’. The variability of the residuals seems rather constant across different fitted values indicating that the fitted model is appropriate. In the probability plot for residuals, Normal Q-Q, the points all mostly in a straight line. The spread of standardized residuals do not change as a function of leverage meaning that variance of the explanatory variable ‘surf’ is rather constant for different values of the outcome variable ‘points’.
#install.packages("GGally")
library(tidyverse)
library(GGally)
library(ggplot2)
#alc<-read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/alc.csv", sep=";", header=TRUE)
# Alternative for downloading from working directory data folder
alc<-read.csv("data/alc.csv")
#Checking variable names
names(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "failures" "paid" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
This data illustrate student achievement in secondary education of two Portuguese schools. Two datasets are provided for the performance in mathematics (mat) and Portuguese language (por). The data was acquired using school reports and questionnaires.
The aim of this analysis is to find associations between levels of alcohol consumption and other variables in the data set. Here we focus on the relationship of high and low alcohol consumption with age, travel time between home and school, weekly study time and number of past class failures. We hypothesize that age and past class failures are positively correlated with high alcohol consumption while study time might be negatively correlated. Travel time should not have any impact on alcohol consumption.
#Exploring the distributions of the chosen variables and their relationships with alcohol consumption: age
#Age vs high alcohol use, create cross-table for high and low alcohol use based on age
alc %>%
group_by(high_use, age) %>%
tally() %>%
spread(high_use, n)
## # A tibble: 7 × 3
## age `FALSE` `TRUE`
## <int> <int> <int>
## 1 15 63 18
## 2 16 74 28
## 3 17 62 35
## 4 18 52 25
## 5 19 7 4
## 6 20 1 NA
## 7 22 NA 1
#create linear regression model
lm_age <-lm(high_use~age,data = alc)
summary(lm_age)
##
## Call:
## lm(formula = high_use ~ age, data = alc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4405 -0.3174 -0.2764 0.6416 0.7646
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.37994 0.33479 -1.135 0.2572
## age 0.04102 0.02015 2.036 0.0425 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4569 on 368 degrees of freedom
## Multiple R-squared: 0.01114, Adjusted R-squared: 0.008452
## F-statistic: 4.145 on 1 and 368 DF, p-value: 0.04246
#Create box-plot
g1 <- ggplot(alc, aes(x = high_use, y = age))
# define the plot as a boxplot and draw it
g1 + geom_boxplot() + ylab("age")
Based on the cross table, students were 15-22 years old, most students report low alcohol consumption. The number of students reporting low alcohol use is higher towards younger age. According to the linear regression model age is statistically significantly associated with high alcohol consumption. The box plot illustrates that students reporting high alcohol use tend to be older.
#Exploring the distributions of the chosen variables and their relationships with alcohol consumption: traveltime
#travel time vs high alcohol use, create cross-table for high and low alcohol use based on traveltime
alc %>%
group_by(high_use, traveltime) %>%
tally() %>%
spread(high_use, n)
## # A tibble: 4 × 3
## traveltime `FALSE` `TRUE`
## <int> <int> <int>
## 1 1 174 68
## 2 2 73 26
## 3 3 10 11
## 4 4 2 6
#create linear regression model
lm_traveltime <-lm(high_use~traveltime,data = alc)
summary(lm_traveltime)
##
## Call:
## lm(formula = high_use ~ traveltime, data = alc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.5323 -0.2594 -0.2594 0.6496 0.7406
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.16849 0.05428 3.104 0.00205 **
## traveltime 0.09095 0.03378 2.692 0.00742 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.455 on 368 degrees of freedom
## Multiple R-squared: 0.01931, Adjusted R-squared: 0.01665
## F-statistic: 7.247 on 1 and 368 DF, p-value: 0.007425
#Create box-plot
g2 <- ggplot(alc, aes(x = high_use, y = traveltime))
# define the plot as a boxplot and draw it
g2 + geom_boxplot() + ylab("traveltime")
Travel time is given numeric values from 1 to 4 based on following: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour. Most students have a less than 15 minute commute to school. According to the linear regression model travel time is positively correlated with high alcohol consumption and the association is statistically significant.
#Exploring the distributions of the chosen variables and their relationships with alcohol consumption: studytime
#study time time vs high alcohol use, create cross-table for high and low alcohol use based on study time
alc %>%
group_by(high_use, studytime) %>%
tally() %>%
spread(high_use, n)
## # A tibble: 4 × 3
## studytime `FALSE` `TRUE`
## <int> <int> <int>
## 1 1 56 42
## 2 2 128 57
## 3 3 52 8
## 4 4 23 4
#create linear regression model
lm_studytime <-lm(high_use~studytime,data = alc)
summary(lm_studytime)
##
## Call:
## lm(formula = high_use ~ studytime, data = alc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4211 -0.3050 -0.1889 0.5789 0.9272
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.53720 0.06094 8.815 < 2e-16 ***
## studytime -0.11609 0.02755 -4.213 3.17e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4488 on 368 degrees of freedom
## Multiple R-squared: 0.04602, Adjusted R-squared: 0.04343
## F-statistic: 17.75 on 1 and 368 DF, p-value: 3.17e-05
#Create box-plot
g3 <- ggplot(alc, aes(x = high_use, y = studytime))
# define the plot as a boxplot and draw it
g3 + geom_boxplot() + ylab("study time")
Weekly study time is given numeric values from 1 to 4 based on following: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours. Most students study 2 to 5 hours per week. According to the linear regression model study time is negatively correlated with high alcohol consumption and the association is statistically significant.
#Exploring the distributions of the chosen variables and their relationships with alcohol consumption: failures
#study time time vs high alcohol use, create cross-table for high and low alcohol use based on study time
alc %>%
group_by(high_use, failures) %>%
tally() %>%
spread(high_use, n)
## # A tibble: 4 × 3
## failures `FALSE` `TRUE`
## <int> <int> <int>
## 1 0 238 87
## 2 1 12 12
## 3 2 8 9
## 4 3 1 3
#create linear regression model
lm_failures <-lm(high_use~failures,data = alc)
summary(lm_failures)
##
## Call:
## lm(formula = high_use ~ failures, data = alc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.7409 -0.2703 -0.2703 0.5728 0.7297
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.27033 0.02477 10.915 < 2e-16 ***
## failures 0.15685 0.04211 3.725 0.000226 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4511 on 368 degrees of freedom
## Multiple R-squared: 0.03634, Adjusted R-squared: 0.03372
## F-statistic: 13.88 on 1 and 368 DF, p-value: 0.0002259
#Create box-plot
g4 <- ggplot(alc, aes(x = high_use, y = failures))
# define the plot as a boxplot and draw it
g4 + geom_boxplot() + ylab("failures")
Most students have not failed a class. According to the linear regression model class failures are positively correlated with high alcohol consumption and the association is statistically significant. I have no words for my attempted box plot.
#Install necessary packages
library(tidyverse)
library(GGally)
library(ggplot2)
#install.packages(c("MASS", "corrplot"))
This week’s assignment utilizes data from the housing values in suburbs of Boston to explore the association of the explanatory variables on the crime rates in different suburbs. The Boston data set has 506 rows of observations representing data from different suburbs. The data set has 14 columns containing following variables: crim :per capita crime rate by town. zn : proportion of residential land zoned for lots over 25,000 sq.ft. indus : proportion of non-retail business acres per town. chas : Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). nox : nitrogen oxides concentration (parts per 10 million). rm : average number of rooms per dwelling. age : proportion of owner-occupied units built prior to 1940. dis : weighted mean of distances to five Boston employment centres. rad : index of accessibility to radial highways. tax : full-value property-tax rate per 10,000 usd. ptratio: pupil-teacher ratio by town. black : proportion of blacks by town lstat : lower status of the population (percent). medv : median value of owner-occupied homes in usd 1000s.
#Download the Boston data
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
# load the data
data("Boston")
dim(Boston)
## [1] 506 14
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
The key statistical properties of the variables are presented in the Summary table above. The per capita crime rates vary between 0.00632 and 88.97620, the mean 3.61352 and median 0.25651 . The low median relative to the mean gives the impression that there are few suburbs with very high crime rates bringing the overall average higher. The proportion of industrial acres of zoned land use ranges between 0.46 and 27.74 %, mean 11.14 % and median 9.69 %. On average 68.57 % (median 77.50 %) of the owner-occupied houses were built before the 1940’s. The median value of the owner-occupied housing properties in the suburbs/towns ranges between 5.00 and 50.00 1000-usd, mean 22.53, median 21.20.
The correlation matrix below shows the associations of the variables in the data set. The size and intensity of the color of the dot denotes the strength of the association, red color indicating negative and blue positive correlation. Looking at the crime column we can see that the variables indus, nox, age, rad, tax, ptratio, and lstat are associated with higher crime rates. Interpreting these data we can postulate that areas with higher industrial land use, closer access to major highways and accordingly higher nitrogen oxides concentration in the air are more prone to higher criminality. Variables that were associated with lower crime rates are zn, rm, dis, black, medv. Put together the data suggest that areas zoned for houses on larger land slots, higher number of rooms per dwelling, longer distance to major employment centers, higher proportion of blacks in the town and more expensive housing had lower crime rates.
The table showing the numeric correlation matrix shows that rad (index of accessibility to radial highways) has the strongest positive association with criminality and median house price the strongest negative association with criminality.Other inferences can be read from the table as well such as the positive correlation between indus and rad (0.60) suggesting that industrial areas have bettew access to radial high ways, which makes logistically sense.
library(tidyr); library(dplyr); library(ggplot2)
library(ggplot2)
library(GGally)
#Examining the correlation of the variables
cor_matrix <- cor(Boston)
round(cor_matrix, digits=2)
## crim zn indus chas nox rm age dis rad tax ptratio
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58 0.29
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31 -0.39
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72 0.38
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04 -0.12
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67 0.19
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29 -0.36
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51 0.26
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53 -0.23
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91 0.46
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00 0.46
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46 1.00
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44 -0.18
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54 0.37
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47 -0.51
## black lstat medv
## crim -0.39 0.46 -0.39
## zn 0.18 -0.41 0.36
## indus -0.36 0.60 -0.48
## chas 0.05 -0.05 0.18
## nox -0.38 0.59 -0.43
## rm 0.13 -0.61 0.70
## age -0.27 0.60 -0.38
## dis 0.29 -0.50 0.25
## rad -0.44 0.49 -0.38
## tax -0.44 0.54 -0.47
## ptratio -0.18 0.37 -0.51
## black 1.00 -0.37 0.33
## lstat -0.37 1.00 -0.74
## medv 0.33 -0.74 1.00
library(corrplot)
## corrplot 0.92 loaded
corrplot(cor_matrix, method="circle")
#Plotting the distributions of the variables
ggpairs(Boston[1:7], mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))
#Histograms for plotting the distribution of the variables
par(mfrow=c(3,3)) #so variables are at one table
hist(Boston$crim, main = paste("crim"), xlab = NULL)
hist(Boston$zn, main = paste("zn"), xlab = NULL)
hist(Boston$indus, main = paste("indus"), xlab = NULL)
hist(Boston$chas, main = paste("chas"), xlab = NULL)
hist(Boston$nox, main = paste("nox"), xlab = NULL)
hist(Boston$rm, main = paste("rm"), xlab = NULL)
hist(Boston$age, main = paste("age"), xlab = NULL)
hist(Boston$dis, main = paste("dis"), xlab = NULL)
Standardizing the data changed the range of the column values which are now much smaller. The column means are 0. Next we create a categorical variable of the crime rate using quantiles as breaking points and replace the old crime variable with the new categorical variable. Then the data set is divided creating a test and training data set.
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
# Save as data frame
boston_scaled<-as.data.frame(boston_scaled)
summary(boston_scaled$crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419367 -0.410563 -0.390280 0.000000 0.007389 9.924110
#Setting quantiles as breaks for categoricals crime variable
bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE)
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
#Creating traning set
library(dplyr)
library(MASS)
boston_scaled <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/boston_scaled.txt",
sep=",", header = T)
boston_scaled$crime <- factor(boston_scaled$crime, levels = c("low", "med_low", "med_high", "high"))
n <- nrow(boston_scaled)
ind <- sample(n, size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
correct_classes <- test$crime
test <- dplyr::select(test, -crime)
#Fitting the linead discriminate model
lda.fit <- lda(crime~., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2524752 0.2400990 0.2400990 0.2673267
##
## Group means:
## zn indus chas nox rm age
## low 0.92980714 -0.9258533 -0.07933396 -0.8988515 0.4699139 -0.8800690
## med_low -0.06532032 -0.3100251 0.01179157 -0.5768188 -0.1052403 -0.3849437
## med_high -0.38822494 0.2422142 0.17414622 0.3976446 0.1591027 0.4564593
## high -0.48724019 1.0169921 -0.05360128 1.0397246 -0.3819122 0.7915953
## dis rad tax ptratio black lstat
## low 0.8842760 -0.7082658 -0.7130674 -0.5065763 0.38009341 -0.764375911
## med_low 0.3456069 -0.5473481 -0.4455566 -0.0508894 0.34017665 -0.188797965
## med_high -0.3775725 -0.3993498 -0.2742829 -0.2856516 0.07052586 -0.007806074
## high -0.8340967 1.6393984 1.5149640 0.7822555 -0.70836257 0.892334358
## medv
## low 0.53869659
## med_low 0.02613753
## med_high 0.19248303
## high -0.69329702
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.115560002 0.549153405 -0.9552562
## indus 0.007920383 -0.426190462 0.2007054
## chas -0.107404810 0.011727132 0.1645388
## nox 0.475231432 -0.853000399 -1.1939660
## rm -0.097149878 -0.155037427 -0.2261259
## age 0.279960711 -0.397045777 -0.3288585
## dis -0.032528220 -0.297542279 -0.0796108
## rad 3.233128821 0.787526851 -0.2526551
## tax -0.034937381 0.253683184 0.5652062
## ptratio 0.130675659 0.007615114 -0.1576227
## black -0.127959932 0.020660774 0.1234552
## lstat 0.195298309 -0.091849318 0.4259614
## medv 0.212021308 -0.366378965 -0.1174221
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9532 0.0363 0.0105
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 2, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(factor(train$crime))
# plot the lda results
plot(lda.fit, dimen = 2,col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)
The linear discriminate plot above shows the clusters of towns with high crime rate on the right side of the graph and other categories of crime rates (low, med low, med high) on the left. Arrows representing the effect size of each explanatory variable are drawn on the graph with rad having the longest arrow towards high crime rate. This means that for any given suburb easier access to the radial highways of greater Boston is associated with criminality.
#Reloading the Boston data and standardizing it
library(MASS)
data("Boston")
boston_scaled2<-as.data.frame(scale(Boston))
#Calculate distances
dist_eu <- dist(boston_scaled2)
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
#Calculating WCSS to determine optimal number of clusters
set.seed(123)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled2, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line')
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
#K-means
km <- kmeans(boston_scaled2, centers = 4)
pairs(boston_scaled2[, 1:5], col = km$cluster)
#Repeating with two clusters
km <- kmeans(boston_scaled2, centers = 2)
pairs(boston_scaled2[, 1:5], col = km$cluster)
Based on the calculations for within cluster sum of squares (WCSS) the optimal number of clusters is two. This is the point at which the value of total WCSS changes radically and the curve tilts the most. For the sake of comparison the k-means clustering is drawn with both four and two clusters. For sake of readability only the first five variables were included for this graphical data clustering.
# Drawing a scatter plot to see if older houses are less valuable than new ones
library(tidyverse)
clusters <- factor(km$cluster)
boston_scaled2 %>% ggplot(aes(x = age, y = medv, col = clusters)) +
geom_point()
The scatter plot shows an example of clustering the observations based on the proportion of owner-occupied units built prior to 1940 (age) and median value of owner-occupied homes in usd 1000s (medv). The red cluster 1 in the lower right corner denotes the cheaper and older housing properties. The second green cluster is more evenlöy distributed.
library(readr)
human<-read_csv("data/human.csv")
## Rows: 155 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (8): Edu2.FM, Labo.FM, Edu.Exp, Life.Exp, GNI, Mat.Mor, Ado.Birth, Parli.F
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
names(human)
## [1] "Edu2.FM" "Labo.FM" "Edu.Exp" "Life.Exp" "GNI" "Mat.Mor"
## [7] "Ado.Birth" "Parli.F"
str(human)
## spc_tbl_ [155 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Edu2.FM : num [1:155] 1.007 0.997 0.983 0.989 0.969 ...
## $ Labo.FM : num [1:155] 0.891 0.819 0.825 0.884 0.829 ...
## $ Edu.Exp : num [1:155] 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ Life.Exp : num [1:155] 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ GNI : num [1:155] 64992 42261 56431 44025 45435 ...
## $ Mat.Mor : num [1:155] 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num [1:155] 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num [1:155] 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
## - attr(*, "spec")=
## .. cols(
## .. Edu2.FM = col_double(),
## .. Labo.FM = col_double(),
## .. Edu.Exp = col_double(),
## .. Life.Exp = col_double(),
## .. GNI = col_double(),
## .. Mat.Mor = col_double(),
## .. Ado.Birth = col_double(),
## .. Parli.F = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
The Human data is a composite of two joined data sets: the Human Development Index (HDI) data and Gender Inequality Index (GII) data. The HDI and GII data originate from the United Nations Development Programme (UNDP).
The UNDP created the HDI emphasize that people and their capabilities should be the ultimate criteria for assessing the development of a country, not economic growth alone. The HDI takes into account a nation’s longevity, education and income. The health dimension is assessed by life expectancy at birth, the education dimension is measured by mean of years of schooling for adults aged 25 years and more and expected years of schooling for children of school entering age. The standard of living dimension is measured by gross national income per capita.
GII is a composite metric of gender inequality using three dimensions: reproductive health, empowerment and the labour market. The indicators of the reproductive health dimension are maternal mortality ratio and adolescent birth rate. The indicators of the empowerment dimension are proportions of women and men with at least secondary education and the shares of parliamentary seats. Ther indicator of the labour market dimension is the labour force participation rate for women and men.
#Histogram
par(mfrow=c(3,3))
hist(human$Edu2.FM,col = 1, main = paste("Women's education ratio"), xlab= "Ratio of women with at least sec education", ylab = "n countries")
hist(human$Labo.FM,col = 2, main = paste("Life expextancy at birth"), xlab= "years", ylab = "n countries")
hist(human$Edu.Exp,col = 3, main = paste("Mean years of schooling"), xlab= "years", ylab = "n countries")
hist(human$Life.Exp,col = 4, main = paste("Life expextancy at birth"), xlab= "years", ylab = "n countries")
hist(human$GNI,col = 5, main = paste("GNI per capita"), xlab= "US dollars", ylab = "n countries")
hist(human$Mat.Mor,col = 6, main = paste("Maternal mortality"), xlab= "n of maternal deaths per 100,000 live births", ylab = "n countries")
hist(human$Ado.Birth,col = 7, main = paste("Adolescent birth rate"), xlab= "years", ylab = "n countries")
hist(human$Parli.F,col = 8, main = paste("Woman in parliament"), xlab= "share of parliamentary seats", ylab = "n countries")
library(tidyr); library(dplyr); library(ggplot2)
library(ggplot2)
library(GGally)
library(corrplot)
library(readr)
#Download data
human<-read_csv("data/human.csv")
## Rows: 155 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (8): Edu2.FM, Labo.FM, Edu.Exp, Life.Exp, GNI, Mat.Mor, Ado.Birth, Parli.F
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#Take a summary of the data
summary(human)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
#A matrix of scatterplots
pairs(human[-1],)
cor_matrix<-cor(human[-1],)
round(cor_matrix, digits=2)
## Labo.FM Edu.Exp Life.Exp GNI Mat.Mor Ado.Birth Parli.F
## Labo.FM 1.00 0.05 -0.14 -0.02 0.24 0.12 0.25
## Edu.Exp 0.05 1.00 0.79 0.62 -0.74 -0.70 0.21
## Life.Exp -0.14 0.79 1.00 0.63 -0.86 -0.73 0.17
## GNI -0.02 0.62 0.63 1.00 -0.50 -0.56 0.09
## Mat.Mor 0.24 -0.74 -0.86 -0.50 1.00 0.76 -0.09
## Ado.Birth 0.12 -0.70 -0.73 -0.56 0.76 1.00 -0.07
## Parli.F 0.25 0.21 0.17 0.09 -0.09 -0.07 1.00
corrplot(cor_matrix,method="square", type="lower",cl.pos="b", tl.pos="lt",tl.cex=0.8)
Principal component analysis (PCA) is a technique for analyzing large datasets containing a high number of dimensions per observation. PCA can increase interpretability of data while preserving the maximum amount of information, and enabling the visualization of multidimensional data. PCA
First we attempt a PCA on the Human data without scaling the data first. This clusters the countries in the upper right corner with a lot of overlap so we can’t discern how the dimensions impact our observations. Only countries representing some extremes in the data are seen separate from the main cluster (eg. Qatar, Kuwait, Sierra Leon etc).
library(readr)
human <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/human2.txt", sep =",", header = TRUE)
pca_human <- prcomp(human)
biplot(pca_human, choices = 1:2, cex=c(0.5,0.8), col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
After standardizing the data we can see that the different countries are more spread out along the axes of the two principal components. The arrows represent the variables of the data. The variation in the data can be illustrated with the two principal components better after the standardization. We can see countries with high standards of living, long life expectancy and higher education clustered to the left and countries with lower standards of living, shorter life expectancy and lower education to the right in the graph. The wealthy countries are further divided in the vertical axes based on gender inequality with many Nordic and Western countries in the upper-left quadrant representing countries with better gender equality. Wealthy countries with poor gender equality are clustered toward th lower left quadrant e.g. the gulf countries. Afghanistan and Niger int he lower right corner of the PCA plot represent countries with low income and high gender inequality.
In a nut shell, we can view the PC1 as the ‘wealth and health’ axes and the PC2 as the ‘gender equality’ axes.
# standardize the variables
human <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/human2.txt", sep =",", header = TRUE)
human_std <- scale(human)
# print out summaries of the standardized variables
summary(human_std)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :-2.8189 Min. :-2.6247 Min. :-2.7378 Min. :-2.7188
## 1st Qu.:-0.5233 1st Qu.:-0.5484 1st Qu.:-0.6782 1st Qu.:-0.6425
## Median : 0.3503 Median : 0.2316 Median : 0.1140 Median : 0.3056
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5958 3rd Qu.: 0.7350 3rd Qu.: 0.7126 3rd Qu.: 0.6717
## Max. : 2.6646 Max. : 1.6632 Max. : 2.4730 Max. : 1.4218
## GNI Mat.Mor Ado.Birth Parli.F
## Min. :-0.9193 Min. :-0.6992 Min. :-1.1325 Min. :-1.8203
## 1st Qu.:-0.7243 1st Qu.:-0.6496 1st Qu.:-0.8394 1st Qu.:-0.7409
## Median :-0.3013 Median :-0.4726 Median :-0.3298 Median :-0.1403
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3712 3rd Qu.: 0.1932 3rd Qu.: 0.6030 3rd Qu.: 0.6127
## Max. : 5.6890 Max. : 4.4899 Max. : 3.8344 Max. : 3.1850
# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human_std)
# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex=c(0.5,0.8), col = c("grey40", "deeppink2"))
library(dplyr)
library(tidyr)
library(readr)
library(ggplot2)
tea <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv",
sep = ",", header = T)
tea$Tea <- factor(tea$Tea)
tea$How <- factor(tea$How)
tea$how <- factor(tea$how)
tea$sugar <- factor(tea$sugar)
tea$where <- factor(tea$where)
tea$lunch <- factor(tea$lunch)
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : chr "breakfast" "breakfast" "Not.breakfast" "Not.breakfast" ...
## $ tea.time : chr "Not.tea time" "Not.tea time" "tea time" "Not.tea time" ...
## $ evening : chr "Not.evening" "Not.evening" "evening" "Not.evening" ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : chr "Not.dinner" "Not.dinner" "dinner" "dinner" ...
## $ always : chr "Not.always" "Not.always" "Not.always" "Not.always" ...
## $ home : chr "home" "home" "home" "home" ...
## $ work : chr "Not.work" "Not.work" "work" "Not.work" ...
## $ tearoom : chr "Not.tearoom" "Not.tearoom" "Not.tearoom" "Not.tearoom" ...
## $ friends : chr "Not.friends" "Not.friends" "friends" "Not.friends" ...
## $ resto : chr "Not.resto" "Not.resto" "resto" "Not.resto" ...
## $ pub : chr "Not.pub" "Not.pub" "Not.pub" "Not.pub" ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : chr "p_unknown" "p_variable" "p_variable" "p_variable" ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : chr "M" "F" "F" "M" ...
## $ SPC : chr "middle" "middle" "other worker" "student" ...
## $ Sport : chr "sportsman" "sportsman" "sportsman" "Not.sportsman" ...
## $ age_Q : chr "35-44" "45-59" "45-59" "15-24" ...
## $ frequency : chr "1/day" "1/day" "+2/day" "1/day" ...
## $ escape.exoticism: chr "Not.escape-exoticism" "escape-exoticism" "Not.escape-exoticism" "escape-exoticism" ...
## $ spirituality : chr "Not.spirituality" "Not.spirituality" "Not.spirituality" "spirituality" ...
## $ healthy : chr "healthy" "healthy" "healthy" "healthy" ...
## $ diuretic : chr "Not.diuretic" "diuretic" "diuretic" "Not.diuretic" ...
## $ friendliness : chr "Not.friendliness" "Not.friendliness" "friendliness" "Not.friendliness" ...
## $ iron.absorption : chr "Not.iron absorption" "Not.iron absorption" "Not.iron absorption" "Not.iron absorption" ...
## $ feminine : chr "Not.feminine" "Not.feminine" "Not.feminine" "Not.feminine" ...
## $ sophisticated : chr "Not.sophisticated" "Not.sophisticated" "Not.sophisticated" "sophisticated" ...
## $ slimming : chr "No.slimming" "No.slimming" "No.slimming" "No.slimming" ...
## $ exciting : chr "No.exciting" "exciting" "No.exciting" "No.exciting" ...
## $ relaxing : chr "No.relaxing" "No.relaxing" "relaxing" "relaxing" ...
## $ effect.on.health: chr "No.effect on health" "No.effect on health" "No.effect on health" "No.effect on health" ...
For the visualizing of the data we choose 6 interesting variables (how, How, pub, sugar, Tea and where):
library(ggplot2)
pivot_longer(tea, cols = 12:17) %>%
ggplot+geom_bar()+(aes(value)) + facet_wrap("name", scales = "free")+theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
Multiple correspondence analysis (MCA) is a method to analyze qualitative data used to detect patterns or structure in the data as well as in dimension reduction. The two axes of the MCA represent The first dimension (Dim1) is the most important dimension in terms of the amount of variance accounted for.
In the Tea data we can observe the most common characteristics clustered around the center of the graph (Dim1=0, Dim2=0). The common tea consumer habit includes Eearl Grey bought from a chain store drank as is or with milk. The more distinguishing features of tea consumption are spread out from the center of the MCA factor map. Looking at the lower right corner we can identify a cluster of variables depicting a true tea devotee who buys his/her loose leaf green tea from a designated tea shop.
library(dplyr)
library(tidyr)
library(FactoMineR)
tea_time<-tea[12:17]
mca <- MCA(tea_time, graph = FALSE)
# summary of the model
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.281 0.267 0.222 0.186 0.179 0.154 0.144
## % of var. 15.325 14.547 12.130 10.166 9.773 8.390 7.844
## Cumulative % of var. 15.325 29.872 42.002 52.168 61.941 70.331 78.176
## Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.136 0.115 0.087 0.062
## % of var. 7.422 6.259 4.751 3.391
## Cumulative % of var. 85.598 91.857 96.609 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr cos2
## 1 | -0.382 0.173 0.140 | -0.266 0.088 0.068 | 0.223 0.074 0.048
## 2 | -0.330 0.129 0.070 | -0.202 0.051 0.026 | 0.758 0.862 0.369
## 3 | -0.435 0.224 0.313 | -0.210 0.055 0.073 | 0.108 0.018 0.019
## 4 | -0.578 0.396 0.534 | -0.159 0.032 0.040 | -0.314 0.148 0.157
## 5 | -0.435 0.224 0.313 | -0.210 0.055 0.073 | 0.108 0.018 0.019
## 6 | -0.435 0.224 0.313 | -0.210 0.055 0.073 | 0.108 0.018 0.019
## 7 | -0.435 0.224 0.313 | -0.210 0.055 0.073 | 0.108 0.018 0.019
## 8 | -0.330 0.129 0.070 | -0.202 0.051 0.026 | 0.758 0.862 0.369
## 9 | 0.256 0.078 0.037 | 0.683 0.584 0.265 | 0.362 0.197 0.075
## 10 | 0.544 0.351 0.181 | 0.461 0.266 0.130 | 0.785 0.923 0.376
##
## 1 |
## 2 |
## 3 |
## 4 |
## 5 |
## 6 |
## 7 |
## 8 |
## 9 |
## 10 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2 v.test
## Not.pub | -0.102 0.486 0.039 -3.414 | -0.191 1.801 0.137 -6.406 |
## pub | 0.383 1.827 0.039 3.414 | 0.719 6.776 0.137 6.406 |
## black | 0.426 2.658 0.059 4.217 | -0.093 0.134 0.003 -0.922 |
## Earl Grey | -0.198 1.501 0.071 -4.605 | 0.238 2.270 0.102 5.519 |
## green | 0.204 0.272 0.005 1.240 | -1.181 9.585 0.172 -7.178 |
## alone | -0.059 0.136 0.007 -1.397 | -0.218 1.923 0.088 -5.127 |
## lemon | 0.841 4.610 0.087 5.110 | 0.518 1.843 0.033 3.147 |
## milk | -0.351 1.531 0.033 -3.126 | 0.139 0.253 0.005 1.239 |
## other | 0.657 0.768 0.013 1.997 | 1.843 6.367 0.105 5.604 |
## No.sugar | 0.221 1.491 0.052 3.943 | -0.076 0.189 0.006 -1.366 |
## Dim.3 ctr cos2 v.test
## Not.pub 0.181 1.932 0.123 6.059 |
## pub -0.680 7.269 0.123 -6.059 |
## black 1.055 20.586 0.365 10.442 |
## Earl Grey -0.462 10.303 0.385 -10.735 |
## green 0.337 0.937 0.014 2.050 |
## alone 0.021 0.021 0.001 0.484 |
## lemon -1.325 14.468 0.217 -8.053 |
## milk 0.342 1.845 0.031 3.053 |
## other 2.015 9.131 0.126 6.128 |
## No.sugar 0.577 12.894 0.356 10.317 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## pub | 0.039 0.137 0.123 |
## Tea | 0.075 0.192 0.425 |
## How | 0.119 0.166 0.340 |
## sugar | 0.052 0.006 0.356 |
## how | 0.686 0.445 0.040 |
## where | 0.716 0.653 0.051 |
# visualize MCA
plot(mca, invisible=c("ind"), graph.type = "classic", habillage = "quali")
Countless cups of tea were consumed during the completion of this assignment
This week we delve into different techniques to analyze longitudinal data where observations are collected on several different occasions over a period of time. First we look into ways to visualize such data.
In this part of the assignment we use data measuring weight gain in rats on different diets over 9-week period. The rats were divided into three groups.
Let’s visualize the weight gain responses of the three groups of our rodents.
# Get RATS data and convert into log form RATSL
library(dplyr); library(tidyr); library(ggplot2)
RATS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", header = TRUE, sep = '\t')
RATS$ID <- factor(RATS$ID)
RATS$Group <- factor(RATS$Group)
RATSL <- pivot_longer(RATS, cols=-c(ID,Group), names_to = "WD",values_to = "Weight") %>% mutate(Time = as.integer(substr(WD,3,4))) %>% arrange(Time)
RATSL$ID <- factor(RATSL$ID)
RATSL$Group <- factor(RATSL$Group)
#Looking at the data
glimpse(RATS)
## Rows: 16
## Columns: 13
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ WD1 <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 470,…
## $ WD8 <int> 250, 230, 250, 255, 260, 265, 275, 255, 415, 420, 445, 560, 465,…
## $ WD15 <int> 255, 230, 250, 255, 255, 270, 260, 260, 425, 430, 450, 565, 475,…
## $ WD22 <int> 260, 232, 255, 265, 270, 275, 270, 268, 428, 440, 452, 580, 485,…
## $ WD29 <int> 262, 240, 262, 265, 270, 275, 273, 270, 438, 448, 455, 590, 487,…
## $ WD36 <int> 258, 240, 265, 268, 273, 277, 274, 265, 443, 460, 455, 597, 493,…
## $ WD43 <int> 266, 243, 267, 270, 274, 278, 276, 265, 442, 458, 451, 595, 493,…
## $ WD44 <int> 266, 244, 267, 272, 273, 278, 271, 267, 446, 464, 450, 595, 504,…
## $ WD50 <int> 265, 238, 264, 274, 276, 284, 282, 273, 456, 475, 462, 612, 507,…
## $ WD57 <int> 272, 247, 268, 273, 278, 279, 281, 274, 468, 484, 466, 618, 518,…
## $ WD64 <int> 278, 245, 269, 275, 280, 281, 284, 278, 478, 496, 472, 628, 525,…
summary(RATS)
## ID Group WD1 WD8 WD15
## 1 : 1 1:8 Min. :225.0 Min. :230.0 Min. :230.0
## 2 : 1 2:4 1st Qu.:252.5 1st Qu.:255.0 1st Qu.:255.0
## 3 : 1 3:4 Median :340.0 Median :345.0 Median :347.5
## 4 : 1 Mean :365.9 Mean :369.1 Mean :372.5
## 5 : 1 3rd Qu.:480.0 3rd Qu.:476.2 3rd Qu.:486.2
## 6 : 1 Max. :555.0 Max. :560.0 Max. :565.0
## (Other):10
## WD22 WD29 WD36 WD43
## Min. :232.0 Min. :240.0 Min. :240.0 Min. :243.0
## 1st Qu.:267.2 1st Qu.:268.8 1st Qu.:267.2 1st Qu.:269.2
## Median :351.5 Median :356.5 Median :360.0 Median :360.0
## Mean :379.2 Mean :383.9 Mean :387.0 Mean :386.0
## 3rd Qu.:492.5 3rd Qu.:497.8 3rd Qu.:504.2 3rd Qu.:501.0
## Max. :580.0 Max. :590.0 Max. :597.0 Max. :595.0
##
## WD44 WD50 WD57 WD64
## Min. :244.0 Min. :238.0 Min. :247.0 Min. :245.0
## 1st Qu.:270.0 1st Qu.:273.8 1st Qu.:273.8 1st Qu.:278.0
## Median :362.0 Median :370.0 Median :373.5 Median :378.0
## Mean :388.3 Mean :394.6 Mean :398.6 Mean :404.1
## 3rd Qu.:510.5 3rd Qu.:516.0 3rd Qu.:524.5 3rd Qu.:530.8
## Max. :595.0 Max. :612.0 Max. :618.0 Max. :628.0
##
names(RATS)
## [1] "ID" "Group" "WD1" "WD8" "WD15" "WD22" "WD29" "WD36" "WD43"
## [10] "WD44" "WD50" "WD57" "WD64"
# Individual response profiles by group for the RATS data.
names(RATSL)
## [1] "ID" "Group" "WD" "Weight" "Time"
ggplot(RATSL, aes(x = Time, y = Weight, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:16, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(RATSL$Weight), max(RATSL$Weight)))
The graph shows that some weight gain was observed in all groups of rats. The lines representing weight of a single rat are closer together in group 1 while the weights in group 2 and 3 are have a wider range. Due to inter group differences the scale is wide making smaller changes harder to discern from this visualization.
Next let’s see how standardizing the Rats data changes the response profile of the rats’ weights.
# Same plots after standardizing the data
RATSL <- RATSL %>%
group_by(Group) %>%
mutate(stdWeights = (Weight - mean(Weight))/sd(Weight) ) %>%
ungroup()
#summary(RATSL)
library(ggplot2)
ggplot(RATSL, aes(x = Time, y = stdWeights, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:16, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
scale_y_continuous(name = " rats standardized")
Now it is easier to see how the rats’ weights developed over time. A tracking phenomenon — the tendency of rats with higher weight at baseline to have higher weight throughout the study — is more discernible in the standardized plot.
library(dplyr)
library(tidyr)
str(RATSL)
## tibble [176 × 6] (S3: tbl_df/tbl/data.frame)
## $ ID : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
## $ WD : chr [1:176] "WD1" "WD1" "WD1" "WD1" ...
## $ Weight : int [1:176] 240 225 245 260 255 260 275 245 410 405 ...
## $ Time : int [1:176] 1 1 1 1 1 1 1 1 1 1 ...
## $ stdWeights: num [1:176] -1.733 -2.829 -1.367 -0.271 -0.637 ...
#Summary data with means only without standard error bars
RATSS <- RATSL %>%
group_by(Group,Time) %>%
summarise( mean = mean(Weight), se = sd(Weight)/sqrt(4) ) %>%
ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
# Glimpse the data
glimpse(RATSS)
## Rows: 33
## Columns: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ Time <int> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, 29, 36, …
## $ mean <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 267.375, 2…
## $ se <dbl> 7.610789, 6.546537, 5.737953, 6.800407, 5.528740, 5.891883, 5.47…
# Plot the mean profiles
#x = Time, y = stdWeights, linetype = ID
library(ggplot2)
ggplot(RATSS, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
geom_line() +
scale_linetype_manual(values = c(1,2,3)) +
geom_point(size=3) +
scale_shape_manual(values = c(1,2,3)) +
geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
#theme(legend.position = c(0.8,0.8,0.8)) +
scale_y_continuous(name = "mean")
The plot whos a gradual weight gain of rats in all groups over the course of the 64 days follow-up.
library(dplyr)
library(tidyr)
# Create a summary data by treatment and subject with mean as the summary variable (ignoring baseline week 0)
RATS_Sum <- RATSL %>%
group_by(Group, ID) %>%
summarise( mean=mean(Weight) ) %>%
ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
# Glimpse the data
glimpse(RATS_Sum )
## Rows: 16
## Columns: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean <dbl> 261.0909, 237.6364, 260.1818, 266.5455, 269.4545, 274.7273, 274.…
# Draw a boxplot of the mean versus treatment
library(ggplot2)
ggplot(RATS_Sum, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(Weight)")
Here we have drawn boxplots for the summary measures of the three groups showing mean weights of each group. We can see that the mean weights of the rats vary between groups a lot with little overlap between groups. The first group has the lightest rats while the third group has the heaviest rats. Group two mean has the widest distribution due to one outlier rat. Next, lets see how removing the fat rat from group 2 changes the boxplot of summary means.
# filtering the outlier from Group 2
RATS_Sum <- filter(RATS_Sum, mean<550)
library(ggplot2)
ggplot(RATS_Sum, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(Weight")
The summary measure varies less in group 2 after removing the outlier making the boxpltos more equal in width for all groups. The relative order of the ascending mean weights from group 1 to 3 remains. To make sure the visually discernible difference is true, let’s perform an analysis of variance (anova) as a formal test for a difference. In the anova output below the p-value (3.387e-12) indicates a statistically significant difference in the means between the groups. The F value (or F statistic) is far from 1 indicating that the variation of the group means is large and significant.
library(dplyr)
library(tidyr)
# Fit the linear model with the mean as the response
fit <- lm(mean ~Group, data = RATS_Sum)
# Compute the analysis of variance table for the fitted model with anova()
anova(fit)
## Analysis of Variance Table
##
## Response: mean
## Df Sum Sq Mean Sq F value Pr(>F)
## Group 2 206390 103195 483.6 3.387e-12 ***
## Residuals 12 2561 213
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This part of the assignment uses data from BPRS data. Here 40 male subjects were randomly assigned to one of two treatment groups and each subject was rated on the brief psychiatric rating scale (BPRS) measured at baseline (week 0) and then at weekly intervals for eight weeks. The BPRS assesses the level of 18 symptom each rated from one (not present) to seven (extremely severe). The scale is used to evaluate patients suspected of having schizophrenia.
First we will plot our data showing the change of bprs scores over time. The slightly overlapping confidence interval shading makes the graph a bit hazy but we can observe a slight decline in the bprs values over time.
library(dplyr); library(tidyr)
BPRS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt", sep =" ", header = T)
BPRS$treatment <- factor(BPRS$treatment)
BPRS$subject <- factor(BPRS$subject)
BPRSL <- pivot_longer(BPRS, cols=-c(treatment,subject),names_to = "weeks",values_to = "bprs") %>% arrange(weeks)
BPRSL <- BPRSL %>% mutate(week = as.integer(substr(weeks,5,5)))
rm(BPRS)
#str(BPRSL)
library(ggplot2)
ggplot(BPRSL, aes(x = week, y = bprs, group = subject)) +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Ignoring the fact that the repeated measurements were done on the same individuals we analyse the data with multiple linear regression first. Looking at the summary of the regression model from the BPRS data where Brief Psychiatric Rating Scale score is the response variable and week and treatment are explanatory variables we can see that the variable week is negatively correlated with the BPRS score (t-value -8.995) and the association is statistically significant (p<2e-16). It appears that over time the patients get better regardless of the intervention. Treatment 2 showed no statistically significant effect on the BPRS score. However, the model makes the unlikely assumption that the repeated measures of bprs scores are independent.
BPRS_reg <- lm(bprs ~ week + treatment, data=BPRSL)
# print out a summary of the model
summary(BPRS_reg)
##
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRSL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.454 -8.965 -3.196 7.002 50.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.4539 1.3670 33.982 <2e-16 ***
## week -2.2704 0.2524 -8.995 <2e-16 ***
## treatment2 0.5722 1.3034 0.439 0.661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared: 0.1851, Adjusted R-squared: 0.1806
## F-statistic: 40.55 on 2 and 357 DF, p-value: < 2.2e-16
Because the observations of the bprs scores are not independent of one another we need additional models. First we shall create a random intercept model where (1 | subject) denotes the random-effects term. The summary of the random intercept model (BPRS_ref) shows similar findings than the previous multiple linear model: time appears to be negatively correlated with bprs score. Looking at the standard error of time (week) in the two models, 0.2524 for the linear model 0.2084 for the random intercept model, there is slightly larger in the former. This is because the assumption of independence of a within-subject covariate leads to a larger standard error.
library(dplyr); library(tidyr)
library(lme4)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
BPRS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt", sep =" ", header = T)
BPRS$treatment <- factor(BPRS$treatment)
BPRS$subject <- factor(BPRS$subject)
BPRSL <- pivot_longer(BPRS, cols=-c(treatment,subject),names_to = "weeks",values_to = "bprs") %>% arrange(weeks)
BPRSL <- BPRSL %>% mutate(week = as.integer(substr(weeks,5,5)))
#create a random intercept model
BPRS_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL, REML = FALSE)
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2748.7 2768.1 -1369.4 2738.7 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0481 -0.6749 -0.1361 0.4813 3.4855
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 47.41 6.885
## Residual 104.21 10.208
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 1.9090 24.334
## week -2.2704 0.2084 -10.896
## treatment2 0.5722 1.0761 0.532
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.437
## treatment2 -0.282 0.000
Here we fit a random intercept and random slope model with week and subject as the random effects. The association of the week with bprs score is similar that in the previous models (t-value -7.433). The likelihood ratio test provides a Chi-squared 7.2721 and a small p-valua 0.02636. Therefore the random intercept and random slope model appears to provide a superior fit compared to the random intercept model. The ouptus are given below.
# create a random intercept and random slope model
BPRS_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRSL, REML = TRUE)
summary(BPRS_ref1)
## Linear mixed model fit by REML ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
## Data: BPRSL
##
## REML criterion at convergence: 2727.2
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.8836 -0.6117 -0.0651 0.5423 3.7925
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 69.144 8.315
## week 1.052 1.026 -0.52
## Residual 97.736 9.886
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 2.1568 21.538
## week -2.2704 0.3055 -7.433
## treatment2 0.5722 1.0421 0.549
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.583
## treatment2 -0.242 0.000
# perform an ANOVA test on the two models
anova(BPRS_ref1, BPRS_ref)
## refitting model(s) with ML (instead of REML)
## Data: BPRSL
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRS_ref 5 2748.7 2768.1 -1369.4 2738.7
## BPRS_ref1 7 2745.4 2772.6 -1365.7 2731.4 7.2721 2 0.02636 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Fianlly, let’s add the interaction of weeks and treatment to the mode and create a random intercept and slope mode with interaction. Comparing the two models the random intercept and random slope model with interaction (BPRS_ref2) does not provide any better fit than the corresponding comparison model without the interaction as an explanatory variable (BPRS_ref1). This is evidenced my the small Chi-squared value 3.1712 and the high p-value 0.07495. The outputs to this analysis are shown below.
# Work with the exercise in this chunk, step-by-step. Fix the R code!
# RATS and RATSL are available
# create a random intercept and random slope model with the interaction
library(lme4)
BPRS_ref2 <- lmer(bprs ~ week + treatment + (week | subject) + (week*treatment), data = BPRSL, REML = FALSE)
summary(BPRS_ref2)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject) + (week * treatment)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2744.3 2775.4 -1364.1 2728.3 352
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0512 -0.6271 -0.0768 0.5288 3.9260
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.9964 8.0620
## week 0.9687 0.9842 -0.51
## Residual 96.4707 9.8220
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 47.8856 2.2521 21.262
## week -2.6283 0.3589 -7.323
## treatment2 -2.2911 1.9090 -1.200
## week:treatment2 0.7158 0.4010 1.785
##
## Correlation of Fixed Effects:
## (Intr) week trtmn2
## week -0.650
## treatment2 -0.424 0.469
## wek:trtmnt2 0.356 -0.559 -0.840
# perform an ANOVA test on the two models
anova(BPRS_ref2, BPRS_ref1)
## refitting model(s) with ML (instead of REML)
## Data: BPRSL
## Models:
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## BPRS_ref2: bprs ~ week + treatment + (week | subject) + (week * treatment)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRS_ref1 7 2745.4 2772.6 -1365.7 2731.4
## BPRS_ref2 8 2744.3 2775.4 -1364.1 2728.3 3.1712 1 0.07495 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here we plot the bprs scores as observed and with fitted values. The descending slope is more apparent in the fitted plot.
# draw the plot of BPRSL with the observed bprs score values
ggplot(BPRSL, aes(x = week, y = bprs, group = subject)) +
geom_smooth() +
scale_x_continuous(name = "weeks", breaks = seq(0, 60, 20)) +
scale_y_continuous(name = "bprs") +
theme(legend.position = "top")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
# Create a vector of the fitted values
Fitted <- fitted(BPRS_ref2)
library(dplyr)
library(tidyr)
# Create a new column fitted to RATSL
#RATSL$Fitted_values<-Fitted
BPRSL<-BPRSL%>%
mutate(Fitted)
# draw the plot of BPRSL with the Fitted values of weight
library(ggplot2)
ggplot(BPRSL, aes(x = week, y = Fitted, group = subject)) +
geom_smooth() +
scale_x_continuous(name = "weeks", breaks = seq(0, 60, 20)) +
scale_y_continuous(name = "Fitted bprs") +
theme(legend.position = "top")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'